For cases where we have things that can be put into two groups.
\[ P(x=K|N,p) = \frac{N!}{ K! (N-K)!}p^K(1-p)^{N-K} \]
Used as the generative function for Bayesian Inferences.
Can be used to test if a set of data are equivallent to an expectation given by the underlying probability of seeing an item in “Group 1” (catfish in this example):
\(H_O: p = \hat{p}\)
For this, the test statistics is defined as the number of observations in “Group 1” (e.g., \(N_{catfish}\) in this example), and what we know about the probability of selecting items that have two distinct states is used to evaluate the likelihood of getting \(N_{catfish}\).
Borrowing from a previous example, let’s assume we are sampling fish in the James River. The hypothesis is that the frequency of catfish is
\[ p_{catfish} = \frac{N_{catfish}}{N_{catfish} + N_{other}} = \frac{37}{50} \approx 0.74 \]
So if if we have another set of sample, we can test the null hypothesis:
\(H_O: p = 0.74\)
Using the new counts of \(N_{catfish}\) and \(N_{otehr}\).
R has a direct function for this:
That allows testing two-tailed or single-sided alternative hypotheses.
So, let’s assume you went out and sampled new data and found
Exact binomial test
data: samples
number of successes = 52, number of trials = 99, p-value = 5.007e-06
alternative hypothesis: true probability of success is not equal to 0.74
95 percent confidence interval:
0.4223973 0.6265570
sample estimates:
probability of success
0.5252525
Conclusion?
As normal, the object that is returned by the function has all the necessary items for you to include the data into your manuscript.
When we have more than two unique states, we can extend the Binomial test to a more generalized Multinomial expansion. This would be used to test the hypothesis about specific probabilities for \(K-1\) states (the number of independently assorting groups). However, this is a specific version of a more generalilzed approach using Contingency Tables and I won’t be going over it here.
However, there are a few libraries in R available if you’d prefer to use the multinomial expansion.
Contingency tables are another counting approach, where your data can be classified into one or more categories. Often it is taken as a comparison between populations. So, in keeping with our fish theme, let’s assume we went to another river and did another sampling run on catfish. In general, we are not limited to “Group A” vs “Group B” but we will use it for continuity here.
This is defined as an \(RxC\) Contingency Table, where \(R\) is the number of Rows and \(C\) is the number of columns.
In this example, I am going to assume that the rows are independent sampling events and the columns represent the grouping of individual sample observations.
For this, we are assuming that the observations in the row are taken from a larger background population.
The same applies for samples from Population 2.
\(H_O: p_{catfish, \; pop1} = p_{catfish, \; pop2}\)
If we had more than just two Groups, we expand these to the vector of probabilities by row.
\(H_O: \vec{p}_i = \vec{p}_j\)
And the same applies for more than just two rows.
The test statistic for this is based upon:
The distance between the observed and expectations for the data.
Observed
The count of items in each combination of categories. - \(O_{11}\), \(O_{21}\), etc.
Expected
Based on number of categories OR external information.
Let’s assume I went out and sampled 50 more fishes and my overall data are now:
If we have no data, and the underlying hypothesis is that samples may be put into any of the Groups randomly (e.g., the underlying assumption that all fish species have equal abundance), then the expecation is that the frequencies of observations would be \(\frac{1}{K}\), where \(K\) is the number of groups.
In this example, we have a prior expectation that:
\[ p_{catfish} = 0.74 \]
\[ p_{other} = 1.0 - 0.74 = 0.26 \]
The test statistic should measure how close the data in these two matrices are.
This is based upon an approximation to the \(\chi^2\) distribution, which is well behaved and described for counting data.
\[ T = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
I’m going to show you how to estimate this from the raw count data so we can get an example where you have to look up the probability for an underlying distribution. There is a chisq.test() function, but you have to be very careful that you have it set up properly for contingency table analyses.
The general contingency table approach is based upon \(df = (r-1)(c-1)\) independent elements and is distributed as a \(\chi^2\) random variable. In a classic ‘hypothesis testing’ framework with \(\alpha = 0.05\), we would estimate the critical value of the test statistic as:
To get the actual probabilty associated with $T =$26.33, we pass the test statistic to the quantile function for the \(\chi^2\) distribution.
Sample Species Site Measurement
1 1 Catfish Population 1 18.01923
2 2 Catfish Population 1 12.63293
3 3 Catfish Population 1 12.29715
4 4 Catfish Population 1 12.82386
5 5 Catfish Population 1 11.64118
6 6 Catfish Population 1 12.05839
The table() function can take the columns of factors and provide for you the observed matrix of counts. If you name your factors descriptively, it will take care of the row and column headers for you.
There are equivallent tests (ok, not equivallent but can be used to answer a similar question) that do not have assumptions of normality, homoscedasticity, etc.
Important
These are largely based upon Ranks and Combinatory Theory.
You gain the applicability with a potential loss of information (and hence statistical power).
All of these methods are based upon ranks.
The normal cor.test() function has an option to swap out the Pearson-Product Moment approach for one based on Ranks.
I’m going to use the Iris data set to look at a correlation for relatively poorly behaved data.
fit_pearson <- cor.test( iris$Sepal.Length, iris$Sepal.Width )
fit_kendall <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "kendall" )
fit_spearman <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "spearman" )
data.frame( Model=c("Pearson","Kendall","Spearman"),
Parameter = c("r","tau","rho"),
R = c(fit_pearson$estimate, fit_kendall$estimate, fit_spearman$estimate),
P = c(fit_pearson$p.value, fit_kendall$p.value, fit_spearman$p.value ) ) -> df Here is the difference in the models (the ties impacted the Spearman p.value estimate).
There are a few different methods for estimating linear regression models via non-parametric approaches. However, they’d require a bit more background on counting statistics and I’m going to pass on this here.
If you need to do non-parametric regression, come see me and we’ll spend a bit of time getting the foundation set for it.
This is an extension of the t-test for the equivalence of the mean values of two data sets (e.g., \(H_O: \bar{x} = \bar{y}\)).
This approach has the following assumptions: